Combining Sentence Length with Location Information to Align Monolingual Parallel Texts

نویسندگان

  • Weigang Li
  • Ting Liu
  • Sheng Li
چکیده

Abundant Chinese paraphrasing resource on Internet can be attained from different Chinese translations of one foreign masterpiece. Paraphrases corpus is the corpus that includes sentence pairs to convey the same information. The irregular characteristics of the real monolingual parallel texts, especially without the strictly aligned paragraph boundaries between two translations, bring a challenge to alignment technology. The traditional alignment methods on bilingual texts have some difficulties in competency for doing this. A new method for aligning real monolingual parallel texts using sentence pair’s length and location information is described in this paper. The model was motivated by the observation that the location of a sentence pair with certain length is distributed in the whole text similarly. And presently, a paraphrases corpus with about fifty thousand sentence pairs is constructed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Alignment of English-Chinese Bilingual Texts of CNS News

In this paper we address a method to align EnglishChinese bilingual news reports from China News Service, combining both lexical and statistical approaches. Because of the sentential structure differences between English and Chinese, matching at the sentence level as in many other works may result in frequent matching of several sentences en masse. In view of this, the current work also attempt...

متن کامل

MT-based Sentence Alignment for OCR-generated Parallel Texts

The performance of current sentence alignment tools varies according to the to-bealigned texts. We have found existing tools unsuitable for hard-to-align parallel texts and describe an alternative alignment algorithm. The basic idea is to use machine translations of a text and BLEU as a similarity score to find reliable alignments which are used as anchor points. The gaps between these anchor p...

متن کامل

Acquiring Synonyms from Monolingual Comparable Texts

This paper presents a method for acquiring synonyms from monolingual comparable text (MCT). MCT denotes a set of monolingual texts whose contents are similar and can be obtained automatically. Our acquisition method takes advantage of a characteristic of MCT that included words and their relations are confined. Our method uses contextual information of surrounding one word on each side of the t...

متن کامل

A Keyword-based Monolingual Sentence Aligner in Text Simplification

We introduce a method for learning to align sentences in monolingual parallel articles for text simplification. In our approach, word keyness is integrated to prefer aligning essential words in sentences. The method involves estimating word keyness based on TF*IDF and semantic PageRank, and word nodes’ parts-of-speech and degrees of reference. At run-time, the keyword analyses are used as word ...

متن کامل

Domain-Independent Novel Event Discovery and Semi-Automatic Event Annotation

Information Extraction (IE) is becoming increasingly useful, but it is a costly task to discover and annotate novel events, event arguments, and event types. We exploit both monolingual texts and bilingual sentence-aligned parallel texts to cluster event triggers and discover novel event types. We then generate event argument annotations semiautomatically, framed as a sentence ranking and seman...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004